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Abstract The advent of large data-set in cosmology has meant that in the past 10 
or 20 years our knowledge and understanding of the Universe has changed not only 
quantitatively but also, and most importantly, qualitatively. Cosmologists are inter- 
ested in studying the origin and evolution of the physical Universe. They rely on data 
where a host of useful information is enclosed, but is encoded in a non-trivial way. 
The challenges in extracting this information must be overcome to make the most of 
the large experimental effort. Even after having analyzed a decade or more of data 
and having converged to a standard cosmological model (the so-called and highly 
successful LCDM model) we should keep in mind that this model is described by 10 
or more physical parameters and if we want to study deviations from the standard 
model the number of parameters is even larger. Dealing with such a high dimen- 
sional parameter space and finding parameters constraints is a challenge on itself. 
In addition, as gathering data is such an expensive and difficult process, cosmolo- 
gists want to be able to compare and combine different data sets both for testing 
for possible disagreements (which could indicate new physics) and for improving 
parameter determinations. Finally, always because experiments are so expansive, 
cosmologists in many cases want to find out a priori, before actually doing the ex- 
periment, how much one would be able to learn from it. For all these reasons, more 
and more sophisiticated statistical techniques are being employed in cosmology, and 
it has become crucial to know some statistical background to understand recent lit- 
erature in the field. Here, I will introduce some statistical tools that any cosmologist 
should know about in order to be able to understand recently published results from 
the analysis of cosmological data sets. I will not present a complete and rigorous 
introduction to statistics as there are several good books which are reported in the 
references. The reader should refer to those. I will take a practical approach and 
I will touch upon useful tools such as statistical inference, Bayesians vs Frequen- 
tist approach, chisquare and goodness of fit, confidence regions, likelihood, Fisher 
matrix approach, Monte Carlo methods and a brief introduction to model testing. 
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Throughout, I will use practical examples often taken from recent literature to illus- 
trate the use of such tools. Of course this will not be an exhaustive guide: it should 
be interpreted as a "starting kit", and the reader is warmly encouraged to read the 
references to find out more. 



1 Introduction 

As cosmology has made the transition from a data-starved science to a data-driven 
science, the use of increasingly sophisticated statistical tools has increased. As ex- 
plained in detail below, cosmology is intrinsically related to statistics, as theories of 
the origin and evolution of the Universe do not predict, for example, that a particu- 
lar galaxy will form at a specific point in space and time or that a specific patch of 
the cosmic microwave background will have a given temperature: any theory will 
predict average statistical properties of our Universe, and we can only observe a 
particular realization of that. 

It is often said that cosmology has entered the precision era: "precision" requires 
a good knowledge of the error-bars and thus confidence intervals of a measurement. 
This is an inherently statistical statement. We should try however to go even fur- 
ther, and achieve also "accuracy" (although cosmology does not have a particularly 
stellar track record in this regard). This requires quantifying systematic errors (be- 
yond the statistical ones) and it is also requires statistical tools. For all these reasons, 
knowledge of basic statistical tools has become indispensable to understand the re- 
cent cosmological literature. 

Examples of applications where probability and statistics are crucial in Cosmol- 
ogy are: i) Is the universe homogenous and isotropic on large scales? ii) are the 
initial conditions consistent with being Gaussian? Hi) is there a detection of non- 
zero tensor modes? iv) what is the value of the density parameter of the Universe 
£2 m given the WMAP data for a LCDM model? v) what are the allowed values at a 
given confidence level for the primordial power spectrum spectral slope nl vi) what 
is the best fit value of the dark energy equation of state parameter w? vii) Is a model 
with equation of state parameter different from - 1 a better fit to the data than a model 
with non-zero curvature? viii) what will be the constraint on the parameter w for a 
survey with given characteristic? 

The first three questions address the hypothesis testing issue. You have an hy- 
pothesis and you want to check wether the data are consistent with it. Sometimes, 
especially for addressing issues of "detection" you can test the null hypothesis: as- 
sume the quantity is zero and test wether the data are consistent with it. 

The next three questions are "parameter estimation" problems: we have a model, 
in this example the LCDM model, which is characterized by some free parameters 
which we would like to measure. 

The next question, vii), belong to "model testing": we have two models and ask 
which one is a better fit to the data. Model testing comes in several different flavors: 
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the two models to be considered may have different number of parameters or equal 
number of parameters, may be have some parameters in common or not etc. 

Finally question viii) is on "forecasting", which is particularly useful for or 
quickly forecasting the performance of future experiments and for experimental de- 
sign. 

Here we will mostly concentrate in the issue of parameter estimation but also 
touch upon the other applications. 



2 Bayesian vs Frequentists 

The world is divided in Frequentists and Bayesians. For Frequentsists probabilities 
are frequencies of occurence: 



where n denotes the number of successes and N the total number of trials. Frequen- 
tists define probability as the limit for the number of independent trials going to 
infinity. Bayesians interpret probabilities as degree of belief in a Hypothesis. 

Let us say that x is our random variable (event). Depending on the application, 
x can be the number of photons hitting a detector, the matter density in a volume, 
the Cosmic Microwave Background temperature in a direction in the sky, etc. The 
probability that x takes a specific value is £?(x) where 3? is called probability dis- 
tribution. Note that probabilities (the possible values of x) can be discrete or con- 
tinuous. &(x) is a probability density: £P(x)dx is the probability that the random 
variable x takes a value between x and x + dx. Frequentists only consider probability 
distributions of events while Bayesians consider hypothesis as events. 

For both, the rules of probability apply. 

• 1. ^>(x)>0 

• 2. f'^ 00 dx3 g (x) = 1. In the discrete case / — > £. 

• 3. For mutually exclusive events &(x\Ux2) = 3P(x\.OR.X2) = &(x\) + S?(x2) 

• 4. In general 3^{x\,X2) = 3^{x\)3 a {x\\x2). In words: the probability of x\ AND 
%2 to happen is the probability of x\ times the conditional probability of X2 given 
thatjti has already happened. 

The last item deserves some discussion. Example here. Only for independent events 
where S?{xt\x\) = S g {x2) one can write ^{x\,xj) = ,9 > {x\)@ > {x2)- Of course in 
general one can always rewrite £?(x\,X2) — 8P(x\)3P(x\\x2) by switching xjand 
X2- If then one makes the apparently tautological identification that 3 g (x\,X2) = 
£P(x2,xi) and substitute x\ — > D standing for data and X2 — ► H standing for hy- 
pothesis, one gets Bayes theorem : 



(2) 
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S?(H\D) is called the posterior, &>(D\H) is the likelihood (the probability of the 
data given the hypothesis, and &(H) is called the prior. Note that here explicitly 
we have probability and probability distribution of a hypothesis. 



3 Bayesian approach and statistical inference 

Despite its simplicity, Bayes theorem is at the base of statistical inference. For the 
Bayesian point of view let us use D to indicate our data (or data set). The hypoth- 
esis H can be a model, say for example the LCDM model, which is characterized 
by a set of parameters 6. In the Bayesian framework what we want to know is 
"what is the probability distribution for the model parameters given the data?" i.e. 
&(9\D). From this information then we can extract the most likely value for the 
parameters and their confidence limitfl However what we can compute accurately, 
in most instances, is the likelihood, which is related to the posterior by the prior. 
(At this point one assumes one has collected the data and so &{D) = 1). The prior 
however can be somewhat arbitrary. This is a crucial point to which we will return 
below. For now let us consider an example: the constraint from WMAP data on the 
integrated optical depth to the last scattering surface T. One could do the analysis 
using the variable T itself, however one could also note that the temperature data 
(the angular power spectrum of the temperature fluctuations) on large scale depend 
approximately linearly on the variable Z = exp(— 2t). A third person would note 
that the polarization data (in particular the EE angular power spectrum) depends 
roughly linearly on T 2 . So person one could use a uniform prior in T, person two 
a uniform prior in exp(— 2f) and person three in T 2 . What is the relation between 
£»(t), ^(Z)and ^(t 2 )? 



3.1 Transformation of variables 

We wish to transform the probability distribution of S?(x) to the probability distri- 
bution of §f (y) with y a function of x. Recall that probability is a conserved quantity 
(we can't create or destroy probabilities...) so 



&>(x)dx = y{y)dy 



(3) 



thus 



&>(x)=&(y(x)) 



dy 
dx 



(4) 



1 At this point many Frequentists stop reading this document. 
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Following the example above if x is 1 and y is exp(r) then & is related to Sf by 
a factor 2t, and if y is T 2 by a factor 2. In other words using different priors leads to 
different posteriors. This is the main limitation of the Bayesian approach. 



3.2 Marginalization 

So far we have considered probability distributions of a random variable x, but one 
could analogously define multivariate distributions , the joint probability distri- 
bution of two or more variables e.g., ^(x,y). A typical example is the description 
of the initial distribution of the density perturbations in the Universe. Motivated by 
inflation and by the central limit theorem, the initial distribution of density perturba- 
tion is usually described by a multi-variate Gaussian: at every point in space given 
by its spatial coordinates (x, y, z), is taken to be a random Gaussian distribution. 
Another example is when one simultaneously constrains the parameters of a model, 
say for example 9 = {£2 m , Hq} (here Hq denotes the Hubble constant). If you have 
<5^(i2 m ,//o) and want to know the probability distribution of £2 m regardless of the 
values of Hq then: 



3.3 Back to statistical inference and Cosmology 

Let us go back to the issue of statistical inference and follow the example from (TJ. 
If you have an urn with N red balls and M blue balls and you draw one ball at the 
time then probability theory can tell you what are your chances of picking a red ball 
given that you have already drawn n red and m blue: &(D\H). However this is not 
what you want to do: you want to make a few drawn from the urn and use probability 
theory to tell you what is the red vs blue distribution inside the urn: &(H\D). In the 
Frequentist approach all you can compute is 3P(D\H). 
In the case of cosmology it gets even more complicated. 

We consider that the Universe we live in is a random realization of all the possible 
Universes that could have been a realization of the true underlying model (which is 
known only to Mother Nature). All the possible realizations of this true underlying 
Universe make up the ensamble. In statistical inference one may sometime want 
to try to estimate how different our particular realization of the Universe could be 
from the true underlying one. Going back to the example of the urn with red and 
blue balls, it would as if we were to be drawing from one particular urn, but the urn 
is part of a large batch. On average, the batch distribution has 50% red and 50% 
blue, but each urn has only an odd number of balls so any particular urn cannot 
reflect exactly the 50-50 spilt. 
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A crucial assumption of standard cosmology is that the part of the Universe that 
we can observe is a fair sample of the whole. But the peculiarity in cosmology is 
that we have just one Universe, which is just one realization from the ensemble 
(quite fictitious one: it is the ensemble of all possible Universes). The fair sample 
hypothesis states that samples from well separated parts of the Universe are inde- 
pendent realizations of the same physical process, and that, in the observable part 
of the Universe, there are enough independent samples to be representative of the 
statistical ensemble. 

In addition, experiments in cosmology are not like lab experiments: in many 
cases observations can't easily be repeated (think about the observation of a partic- 
ular Supernova explosion or of a Gamma ray burst) and we can't try to perturb the 
Universe to see how it reacts... After these considerations, it may be clearer why 
cosmologists tend to use the Bayesian approach. 



4 Chisquare & goodness of fit 

Say that you have a set of observations and have a model, described by a set of 
parameters 9, and want to fit the model to the data. The model may be physically 
motivated or a convenient function. One then should define a merit function, quanti- 
fying the agreement between the model and the data, by maximizing the agreement 
one obtains the best fit parameters. Any useful fitting procedure should provide: 1) 
best fit parameters 2) estimate of error on the parameters 3) possibly a measure of 
the goodness of fit. One should bear in mind that if the model is a poor fit to the data 
then the recovered best fit parameters are meaningless. 

Following Numerical recipes ( JSJ, Chapter 15) we introduce the concept of model 
fitting (parameter fitting) using least- squares. Let us assume we have a set of data 
points D,-, for example these could be the band-power galaxy power spectrum at a set 
of k values, and a model for these data , y(x, 9) which depends on set of parameters 9 
(e.g. the LCDM power spectrum, which depends on n^-primordial power spectrum 
spectral slope-, Og -present-day amplitude of rms mass fluctuations on scale of 8 
Mpc/h-, Q m h etc.). Or it could be for example the supernovae type la distance 
modulus as a function of redshift see e.g. Fig.© J2] [3]. 

The least squares, in its simplest incarnation is: 

X 2 = ^i[Di-y(xi\9)} 2 (6) 

i 

where w, are suitably defined weights. It s possible to show that the minimum vari- 
ance weight is vv, = 1 j a} where a,- denote the error on data point ;. In this case 
then the least squares is called Chisquare. If the data are correlated the chisquare 
becomes: 

X 2 = ZiDi-yixiWQijiDj-yixjW (7) 

y 



Statistical methods in cosmology 



7 




0.2 0.4 0.6 0.8 1.0 1.2 1.4 




Fig. 1 Left: distance modulus vs redshift for Supernovae type 1A from the UNion sample (2j. 
Right: bandpower P(k) for DR5 SDSS galaxies, from 0. In both cases one may fit a theory (and 
the theory parameters) to the data with the Chisquare method. Note that in both cases errors are 
correlated. In the right panel the errors are also strictly speaking not Gaussianly distributed. 



where Q denotes the inverse of the so-called covariance matrix describing the co- 
variance between the data. The best fit parameters are those that minimize the % 2 . 
See an example in Fig|2] 

For a wide range of cases the probability distribution for different values of % 2 
around the minimum of Eq|7]is the % distribution for V = n — m degrees of free- 
dom where n is the number of independent data points and m the number of pa- 
rameters. The probability that the observed % 2 exceeds by chance a value % for 
the correct model is Q(v,x) = 1 — -T(v/2,^/2) where F denotes the incomplete 
Gamma function. See the Numerical Recipes bible (6). Conversely, the probabil- 
ity the the observed J 2 , even for the correct model, is less than % is 1 — Q. While 
this statement is strictly true if measurement errors are Gaussian and the model is 
a linear function of the parameters, in practice it applies to a much wider rangeof 
cases. 

The quantity Q evaluated the the minimum chisquare (i.e. at the best fit values 
for the parameters) give a measure of the goodness of fit. If Q gives a very small 
probability then there are three possible explanations: 

1) the model is wrong and can be rejected. (Strictly speaking: the data are unlikely 
to have happened if the Universe was really described by the model considered) 

2) the errors are underestimated 

3) the measurement errors are non Gaussianly distributed. 

Note that in the example of the power spectrum we know a priori that the errors 
are not Gaussianly distributed. In fact, even if the initial conditions were Gaussian 
and if the underlying matter perturbations were evolving still in the linear regime 
(i.e.<5p /p < 1) and galaxies were nearly unbiased tracers of the dark matter, then 
the density fluctuation itself would obey Gaussian statistics and so would its Fourier 
transform, but not its power spectrum, which is a square quantity. In reality we 
now that by z = perturbations grow non-linearly and that galaxies may not be 
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nearly unbiased tracers of the underlying density field. Nevertheless, the Central 
Limit theorem comes to our rescue, if in each band-power there is a sufficiently 
large number of modes. 

If Q is too large (too good to be true) it is also cause for concern: 

1) errors have been overestimated 

2) data are correlated or non-independent 

3) the distribution is non Gaussian 
Beware: this last case is very rare. 

A useful "chi-by-eye" rule is: the minimum % 2 should be roughly equal to V 
(number of data - number of parameters). This is increasingly true for large v. From 
this, it is easy to understand the use of the so-called "reduced chisquare" that is the 
Xmin/ m - ^ m n (i- e -' num ber of data much larger than the number of parameters 
to fit, which should be true in the majority of the cases) then m ~ v and the rule of 
thumb is that reduced chisquare should be unity. 

Note that the chisquare method, and the Q statistic, give the probability for the 
data given a model &{D\9), not &(6\D). One can make this identification via the 
prior. 



5 Confidence regions 

Once the best fit parameters are obtained, how can one represent the confidence 
limit or confidence region around the best fit parameters? A reasonable choice is to 
find a region in the ra-dimensional parameter space (remember that m is the number 
of parameters) that contain a given percentage of the probability distribution. In 
most cases one wants a compact region around the best fit vales. A natural choice 
is then given by regions of constant % 2 boundaries. Note that there may be cases 
(when the % 2 has more than one minimum) in which one may need to report a 
non-connected confidence region. For multi-variate Gaussian distributions however 
these are ellipsoidal regions. Note that the fact that the data have Gaussian errors 
does not imply that the parameters will have a Gaussian probability distribution... 

Thus, if the values of the parameters are perturbed from the best fit, the % 2 will 
increase. One can use the properties of the % 2 distribution to define confidence in- 
tervals in relation to ^variations or Ax 2 - Table 1 reports the Ax 2 for 68.3%, 95.4% 
and 99.5% confidence levels as function of number of parameters for the joint con- 
fidence level. In the case of Gaussian distributions these correspond to the conven- 
tional 1, 2 and 3 a. Se an example of this in Fig|2] 

Beyond these values here is the general prescription to compute constant-^ 2 
boundaries confidence levels. After having found the best fit parameters by mini- 
mizing the x 2 an d if Q f° r the best fit parameters is acceptable then: 

• 1 Let m be the number of parameters, n the number of data and p be the confi- 
dence limit desired. 

• 2 Solve the following equation for Ax 2 '- 
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Fig. 2 Left: example of a one-dimensional chisquare for a Gaussian distribution as a function of a 
parameter and corresponding 68.3%, 95.4% and 99.5% confidence levels. Right a two dimensional 
example for the Union supernovae data. Figure from Kowlaski et al. (2009)(5J reproduced by 
permission of the AAS. Note that in a practical application even if the data have gaussian errors the 
errors on the parameter may not be well described by multi-variate Gaussians (thus the confidence 
regions are not ellipses). 



Q(n-m 7 min(x) + Ax) = p (8) 

• 3 Find the parameter region where % 2 < min(% 2 ) + Ax 2 - This defines the confi- 
dence region. 



Table 1 Ax 2 for the conventionals 1,2, and 3 — a as a function of the number of parameters for 
the joint confidence levels. 



p 


1 


2 


3 


68.3% 


1.00 


2.30 


3.53 


95.4% 


2.71 


4.61 


6.25 


99.73% 


9.00 


11.8 


14.2 



If the actual error-distribution is non Gaussian but it is known then it is still 
possible to use the % 2 approach, but instead of using the chisquare distribution and 
table [TJ the distribution need to be calibrated on multiple simulated realization of 
the data as illustrated below in the section dedicated to Monte Carlo methods. 



6 Likelihood 

So far we have dealt with the frequentist quantity £P(D\H), If we set 3P(D) = 1 and 
ignore the prior then we can identify the likelihood with ZP{H\D) and thus by max- 
imizing the likelihood we can find the most likely model (or model's parameters) 
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given the data. However having ignored S?(D) and the prior this approach cannot 
give in general a goodness of fit and thus cannot give an absolute probability for a 
given model. However it can give relative probabilities. If the data are Gaussianly 
distributed the likelihood is given by a multi-variate Gaussian: 



^ (2^)«/2|J efC | 1 /2 eXP 



-W{D-y) i C n \D-y) ] 
L ij 



(9) 



where Cy = ((£),- — yj)(Dj — yj)) is the covariance matrix. 

It should be clear from this that the relation between % 2 and likelihood is that, for 
Gaussian distributions, _Sf <*; exp[— l/2x 2 } and minimizing the % 2 is equivalent at 
minimizing the likelihood. In this case likelihood analysis and % 2 coincide and by 
the end of this section, it will this be no surprise that the Gamma function appearing 
in the % 2 distribution is closely related to the Gaussian integrals. 

The subtle step is that now, in Bayesian statistics, confidence regions are regions 
R in model space such that J R £?(9\D)dQ = p where p is the confidence level we 
request (e.g., 68.3%, 95.4% etc.). Note that by integrating the posterior over the 
model parameters, the confidence region depends on the prior information: as seen 
in i j3.1l different priors give different posteriors and thus different regions R. 

It is still possible to report results independently of the prior by using the Like- 
lihood ratio. The likelihood at a particular point in parameter space is compared 
with that at the best fit value, Jz? max where likelihood id maximized. Thus a model is 
acceptable if the likelihood ratio 



A = -2 In 



J2f(0) 



(10) 



is above a given threshold. The connection to the % 2 for Gaussian distribution should 
be clear. In general, the threshold can be calibrated by calculating the entire distri- 
bution of the likelihood ratio in the case that a particular model is the true model. 
Frequently this is chosen to be the best ft model. 

There is a subtlety to point out here. In cosmology the data may be Gaussainly 
distributed and still the % 2 and likelihood ratio analysis may give different results. 
This happens because in identifying likelihood and chisquare we have neglected the 
term [(27r)"/ 2 |JefC| 1 / 2 ]- 1 . If the covariance does not depend on the model or model 
parameters, this is just a normalization factor which drops out in the likelihood ratio. 
However in cosmology often the covariance depends on the model: this happens for 
example when errors are dominated by cosmic variance, like in the case of the CMB 
temperature fluctuations on the largest scales, or on the galaxies power spectrum on 
the largest scales. In this case the cosmology dependence of the covariance cannot 
be neglected, but one can always define a pseudo-chisquare as — 21n^f and work 
with this quantity. 

Let us stress again that the likelihood is linked to the posterior through the prior: 
the identification of the likelihood with the posterior is prior dependent (as we will 
see in an example below). In the absence of any data it is common to assume a flat 
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(uniform) prior: i.e. all value of the parameter in question are equally likely, but other 
choices are possible and sometimes more motivated. For example, if a parameter is 
positive-definite, it may be interesting to use a logaritmic prior (uniform in the log). 

Priors may be assigned theoretically or from prior information gathered from 
previous experiments. If the priors are set by theoretical considerations, it is always 
good practice to check how much the results depend on the choice of the prior. If 
the dependence is significant, it means that the data do not have much statistical 
power to constrain that (those) parameter(s). Information theory helps us quantify 
the amount of "information gain" : the information in the posterior relative to the 
prior: 

~@>{e\D)~ 



J 3»(6\D)\og 



(0) 



d6 



(11) 



6.1 Marginalization: examples 

Some of the model parameters may be uninteresting. For example, in many analyses 
one wants to include nuisance parameters (calibration factors, biases, etc.) but then 
report the confidence level on the real cosmological parameters regardless of the 
value of the nuisance ones. In other cases the model may have say, 10 or more real 
cosmological parameters but we may be interested in the allowed range of only one 
or two of them, regardless of the values of all the other. Typical examples are e.g., 
constraints on the curvature parameter (which we may want to know regardless 
of the values of e.g., Q m or Qj\) or, say, the allowed range for the neutrino mass 
regardless of the power spectrum spectral index or the value of the Hubble constant. 
As explained in i ]3.2l one can marginalize over the uninteresting parameters. 

It should be kept in mind that marginalization is a Bayesian concept: the results 
may depend on the prior chosen. 

In some cases, the marginalization can be carried out analytically. An example is 
reported below: this applies to the case of e.g., calibration uncertainty, point sources 
amplitude, overall scale independent galaxy bias, magnitude intrinsic brightness or 
beam errors for CMB studies. In this case it is useful to know the following results 
for Gaussian likelihoods: 



^{e,..e m ^\DU 



dA 



{2%)^\\C\Y- 



1 



exp 



\ (C,-(Q+Ai' i ))Xy 1 (Cj - (Cj+APj)) 



I (A -A) 



(12) 



repeated indices are summed over and | \C\ | denotes the determinant. Here, A is the 
amplitude of, say, a point source contribution P to the Q angular power spectrum, A 
is the m ,h parameter which we want to marginalize over with a Gaussian prior with 
variance a 1 around A. The trick is to recognize that this integral can be written as: 
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(01..0 m _l|D) =C exp 



--C1-2C2A + C3A- 



dA (13) 



(where Co. ..3 denote constants and it is left as an exercise to write them down 
explicitly) and that this kind of integral is evaluated by using the substitution 
A — ► A — C2/C3 giving something °= exp[— \/2{C\ —C\jC^)\. 

In cases where the likelihood surface (describing the value of the likelihood as 
a function of the parameters) is not a multi-variate Gaussian, the location of the 
maximum likelihood before marginalization may not coincide with the location after 
marginalization. An example is shown in Figf5] The figure show the probability 
distribution for £2^ form WMAP5 data for a model where curvature is free and the 
equation of state parameter for dark energy w is constant in time but not fixed at — 1 . 
The red line shows the N-dimensional maximum posterior value and the black line 
is the marginalized posterior over all other cosmological parameters. 

It should also be added that, even in the case where we have a single-peaked pos- 
terior probability distribution there are two common estimators of the "best" param- 
eters: the peak value (i.e. the most probable value) or the mean, = f dOO £?(0\D). 
If the posterior is non-Guassian these two estimates need not to coincide. In the same 
spirit, slightly different definitions of confidence intervals need not to coincide for 
non-Gaussian likelihoods, as illustrated in the right panel of Fig. [3] for example one 
can define the confidence interval [0/ OH , 0/ !IS /,], such that equal fractions of the pos- 
terior volume lie in (— °°, 0/„„) and idhighi°°)- This is called central credible interval 
and is connected to the median. Another possibility (minimum credible interval) is 
to consider the region so that the posterior at any point inside it is larger that at any 
point outside and so that the integral of the posterior in this region is the required 
fraction of the total. Thus remember, it is always good practice to declare what con- 
fidence interval one is using! This subject is explored in more details in e.g., J9). 



7 Why Gaussian Likelihoods? 

Throughout this lectures we always refer to Gaussian likelihoods. It is worth men- 
tioning that if the data errors are Gaussianly distributed then the likelihood function 
for the data will be a multi-variate Gaussian. If the data are not Gaussianly dis- 
tributed (but still are drawn from a distribution with finite variance!) we can resort 
to the central limit theorem: we can bin the data so that in each bin there is a super- 
position of many independent measurement. The central limit theorem will tell us 
that the resulting distribution (i.e. the error distribution for each bin) will be better 
approximated by a multi-variate Gaussian. However, as mentioned before, even if 
the data are Gaussianly distributed this does not ensure that the likelihood surface 
for the parameters will be a multi-variate Gaussian: for this to be always true the 
model needs to depend linearly on the parameters. Even without resorting to the 
central limit theorem, the Gaussian approximation is in many cases recovered even 
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Fig. 3 Marginalization effects. Top panel: We consider the posterior distribution for the cosmo- 
logical parameters of a dark energy + cold dark matter model where curvature is a free parameter 
and so is a (constant) equation of state parameter for dark energy. The data are the WMAP 5 year 
data. The red line shows the N-dimensional maximum posterior value and the black line is the 
marginalized posterior over all other cosmological parameters. Figure courtesy of LAMBDA [?]. 
Bottom panel: figure from (5). Illustration of Central Credible Interval (CCI) and Minimum Cred- 
ible Interval (MCI), for the case of a LCDM model with free number of effective neutrino species 
(ignore blue dotted line for this example, red line is the marginalized posterior). 



when starting from highly non-gaussian distribution. A neat example is provided by 
Cash [8 1 which we follow here. 

Let's say you want to constrain cosmology by studying clusters number counts 
as a function of redshift. The observation of a discrete number N of clusters is a 
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Poisson process, the probability of which is given by the product 



e i [<'exp(- e; )/»,!] 



(14) 



where «, is the number of clusters observed in the ; — t h experimental bin and e, 
is the expected number in that bin in a given model: e,- = I(x)8xj with i being the 
proportional to the probability distribution. Here 5x, can represent an interval in 
clusters mass and/or redshift. Note: this is a product of Poisson distributions, thus 
one is assuming that these are independent processes. Clusters may be clustered, so 
when can this be used? 

For unbinned data (or for small bins so that bins have only and 1 counts) we 
define the quantity: 



where E is the total expected number of clusters in a given model. The quantity AC 
between two models with different parameters has a % 2 distribution! (so all that was 
said in the % 2 section applies, even though we started from a highly non-Gaussian 
distribution.) 



8 The effect of priors: examples 

Let us consider the two figures in Fig. [4] On the left: WMAP 1st year data constraints 
in the Q m , Qa plane. On the right: models consistent with the WMAP 3 yr data. In 
both cases the model is a non-flat LCDM model. So why the addition of more data 
(the two extra years of WMAP observations) gives worst constraints? The key is 
that what is reported in the plots is a representation of the posterior probability 
distribution. In the left panel a flat prior on &a (angular size distance to the last 
scattering surface, giving by the position of the first peak) was assumed. In the 
figure on the right a flat prior on the Hubble constant Hq was assumed. Remember: 
always declare the priors assumed! 



9 Combining different data sets: examples 

It has become common to "combine data sets" and explore the constraints from the 
"data set combination". This means in practice that the likelihoods can be multiplied 
if the data sets are independent (if not the one should account for the appropriate 
covariance). it is important to note that: If the data-sets are inconsistent, the resulting 
constraints from the combined data set are nonsense. An example is shown in Fig. [5] 
On the left panel we show a figure from Q constraints in the Q m , Cg plane for 
a flat LCDM model for WMAP 3yr data (blue), weak lensing constraints (orange) 
and combined constraints. On the right panel the figure shows the constraints in the 



N 



C=E-21n^ = 2(£-^ln/i) 



(15) 





Fig. 4 Top: WMAP 1st year data constraints in the £l m , Q.^ plane, from Spergel et al 2003, ApJS, 
148:175-194 [4), Bottom: models consistent with the WMAP 3 yr data, from Spergel et al. (2007) 
ApJS, 170, 377(5)- In both cases the model is a non-flat LCDM model. Figures reproduced by 
permission of the AAS. 



£2k> w plane for non-flat dark energy models with constant w for WMAP5+ super- 
novae data (in black) and WMAP5+BAO (in red). Even though the WMAP data 
are in common there is some tension in the resulting constraints. The two data sets 
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WMAP + Weak Lensing 




-0.06 - 



-0.12 



Fig. 5 Left: constraints in the Q m o% plane for a flat LCDM model for WMAP 3yr data (blue), 
weak lensing constraints (orange) and combined constraints.Figure from Spergel et al. 2003 00, 
reproduced by permission of the AAS. Right:Constraints in the £2t,w plane for non-flat dark energy 
models with constant w for WMAP5+ supernovae data (in black) and WMAP5+BAO (in red). 
Figure courtesy of LAMBDA Q. 



(Supernovae and BAO, WMAP and weak lensing ) are not fully consistent: as the 
authors themselves, note, they should not be combined. 



10 Forecasts: Fisher matrix 

Before diving into the details let us re-examine of error estimates for parameters 
from the likelihood. Let us assume a flat prior in the parameter so we can identify 
the posterior with the likelihood. Close to the peaks we can expand the log likelihood 
in Taylor series: 



in^in^(e o) + -p*-e, , ^ 



(8j-e j0 ) + ... (16) 



by truncating this expansion to the quadratic term (remember that by expanding 
around the maximum we have the first derivative equal to zero) we say the the 
likelihood surface is locally a multi-variate Gaussian. The Hessian matrix is defined 
as 

dfhdOj 



^H = -^77T- d7) 
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It enclose information on the parameters errors and their covariance. If this matrix is 
not diagonal it means that the parameters estimates are correlated. Loosely speaking 
we said "the parameters are correlated": it means that they have a similar effect on 
the data and thus the data have hard time in telling them apart. The parameters may 
or may not be physically related with each other. 

More specifically if all parameters are kept fixed except one (parameter /,say), 
the error on that parameter would be given by 1/ \fM\i. This is called conditional 
error but is almost never used or interesting. 

Having understood this, we can move on to the Fisher information matrix ifTOl . 
The Fisher matrix plays a fundamental role in forecasting errors from a given ex- 
perimental set up and thus is the work-horse of experimental design. It is defined 

It should be clear that F = ( . 

Here the average is the ensamble average over observational data (those that 
would be gathered if the real Universe was given by the model -and model parameters- 
around which the derivative is taken). Since, as we have seen the likelihood for 
independent data sets is the product of the likelihoods, it follows that the Fisher ma- 
trix for independent data sets is the sum of the individual Fisher matrices. This will 
become useful later on. 

In the one-parameter case-say only ; component of 6, thinking back at the Taylor 
expansion around the maximum of the likelihood we have that 

A\nSf=^Ftt(di-§if (19) 

when 2A In Jz? = 1 and by identifying it with the A% 2 corresponding to 68% con- 
fidence level, wee see that l/\/Fu yields the 1 — a displacement for 0,. This is the 
analogous to the conditional error from above. In the general case: 

°y > (20) 

Thus when all parameters are estimated simultaneously from the data the marginal- 
ized error is 

^>(F- l )f (21) 

Let's spell it out for clarity: this is the square root of the element ii of the inverse 
of the Fisher information matrix^ This assumes that the likelihood is a Gaussian 
around its maximum (the fact that the data are Gaussianly distributed is no guar- 
antee that the likelihood will be Gaussian, see e.g. Fi^2j. The terrific utility of the 
Fisher Information matrix is that, if you can compute it, it enables you to estimate 
the parameters errors before you do the experiment. If it can be compute it quickly, 
it also enables one to explore different experimental set ups and optimize the exper- 



2 i.e. you have to perform a matrix inversion first. 
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Fig. 6 Marginalised 68% CL constraints on the dark energy pa rameters expected for the DUNE 
weak lensing (blue), a full sky BAO survey (red) and their combination (solid green). This figure 
was derived using the Fisher matrix routines of iCosmo. Figure From Refregier et al 2008. 

iment. This is why the Fisher matrix approach is so useful in survey design. Also 
complementarity of different, independent and uncorrelated experiments (i.e. how 
in combination they can lift degeneracies) can be quickly explored: the combined 
Fisher matrix is the sum of the individual matrices. This is of course extremely 
useful, however read below for some caveats. 

The > is the Kramer-Rao inequality: the Fisher matrix approach always gives you 
an optimistic estimate of the errors (reality is only going to be worst). And this is not 
only because systematic and real world effects are often ignored in the Fisher infor- 
mation matrix calculation, but for a fundamental limitation: only if the likelihood is 
Gaussian that > becomes =. In some cases, when the Gaussian approximation for 
the Likelihood does not hold, it is possible to make non-linear transformation of the 
parameter that make the likelihood Gaussian. Basically, if the data and Gaussianly 
distributed and the model depends linearly on the parameters then the likelihood 
would be Gaussian. So the key is to have a good enough understanding of the theo- 
retical model to be able to find such a transformation. See ifTTI for a clear example. 
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10.1 Computing Fisher matrices 



The simplest, brute force approach to compute a Fisher matrix is as follows: write 
down the likelihood for the data given the model. Instead of the data values (which 
are not known) use the theory prediction for a fiducial model. This will add a con- 
stant term to the log likelihood which does not depend on cosmology. In the co- 
variance matrix include expected experimental errors. Then take derivatives with 
respect to the parameters as indicated in Eq. [18] 

In the case where the data are Gaussianly distributed it is possible to compute 
explicitly and analytically the Fisher matrix, in a much more elegant way than above. 

Fij = l -Tr [C^CiC-'Cj + C^Mij] (22) 

where Mjj = y,iy T j + y.jy T j and , i denotes derivative wit respect to the parameter 0,. 
This is extremely useful: you need to know the covariance matrix (which may de- 
pend on the model and need not to be diagonal) and you need to have a fiducial 
model y which you know how it depends on the parameter 0. Then the Fisher ma- 
trix give you the expected (forecasted) errors. Priors or forecasts results from other 
experiments can be easily included by simply adding their Fisher before performing 
the matrix inversion to obtain the marginal errors. This is illustrated in Figj6] from 
ifPUl and produced using the icosmo ( http://www.icosmo.org/ll software. 



Before we finish this section let us spell out the following prescription. 

Imagine you have compute a large Fisher matrix, varying all parameters Q^, wo, 
neutrino mass m v , number of neutrino species N v , running of the spectral index a 
etc. Now you want to compute constraints for a standard flat LCDM model. Simply 
ignore row and columns corresponding to the parameters that you want to keep fixed 
at the fiducial value before inverting the matrix. 

Imagine now that you have a 6 parameters Fisher matrix (say Hq, Q m ,T, Qa, n , 
Qb, as), and want to produce 2D plots for the confidence regions for parameters 
2 and 4, say, marginalized over all other (1,3,5,6) parameters. Invert Fij. Take the 
sub-matrix made by rows and columns corresponding to the parameters of interest 
(2 and 4 in this case) and invert back this sub matrix. 

The resulting matrix, lets call it i?, describes a Gaussian 2D likelihood surface 
in the parameters 2 and 4 or, in other words, the chisquare surface for parameters 
2,4 - marginalized over all other parameters- can be described by the equation 

jt 2 =£(ft-0 i /a W0y-e/ ii -)- (23) 



From this equation, getting the errors corresponds to finding the quadratic equa- 
tion solution % 2 = A% 2 . For correspondence between A% 2 and confidence region 
see the earlier discussion. If you want to make plots, the equation for the elliptical 
boundary for the joint confidence region in the sub-space of parameters of interest 
is: A = 86J2- l 80. 
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11 Example of Fisher approach applications 

Here we are going to consider two cases of application of Fisher forecasts that are 
extensively used in the literature. This section assumes that the reader is familiar 
with basic CMB and large-scale structure concepts, such as: power spectra, error on 
power spectra, cosmic variance, window and selection function, instrumental noise 
and shot noise, redshift space etc. Some readers may find this section more technical 
than the rest of this document: it is possible to skip it, and continue reading from 



11.1 CMB 

The CMB has become the single dataset that give most constraints on cosmology. 
As the recently launched Planck satellite will yield the ultimate survey for primary 
CMB temperature anisotropics, doing Fisher matrix forecasts of CMB temperature 
data may very soon be obsolete. There remain the scope for forecasting constraints 
from polarization experiemnts, however systematic effects (e.g. foreground subtrac- 
tion) will likely dominate the statistical errors (see e.g., l26l for details). It is still 
however a good exercise to see how one can set up a Fisher matrix analysis for CMB 
data. 

If we have a noiseless full sky survey and the initial conditions are Gaussian we 
can write that the signal in the sky ( i.e. the spherical harmonic transform of the 
anisotropics) is gaussianly distributed, we can write the signal as 

s ( = (fl[,flf,af) (24) 

where a e u denotes the spherical harmonic coefficients fro Temperature, E and B 
model polarization. The covariance matrix Cf is then given by 

'CJ T CJ E \ 

Cj E Cf E (25) 
cf/ 

where Q denotes the angular CMB power spectrum. Using Eq.|22]and considering 
that, for rotational invariance, for every I there are (21 + 1 ) modes, it is possible to 
show that the Fisher matrix for CMB experiements can be rewritten as: 

where the matrix which elements are , where X,Y=TT, TE, EE, BB etc., 
is given b)U 



3 I owe this proof to P. Adshead 
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eye} 







1/2[{C\ 



TE\2 



TTrEE] 










(cf) 2 / 



(27) 



Note that this matrix is more complicated that what one would have obtained by 
assuming a a Gaussian distribution for the Q and no correlation between TT TE 
and EE. Nevertheless Eq. [26] is simple enough and allows one to quickly compute 
forecasts from ideal CMB experiments. 

In this formalism effects of partial sky coverage and of instrumental noise can be 
included (approximatively) by the following substitutions: 



Q 



Q+Ni 



(28) 



where N$ denotes the effective noise power spectrum. Note that N( depends on £ 
even for a perfectly white noise because of beam effects. In addition the partial sky 
coverage can be accounted for by considering that the number of independent modes 
decreases with the sky coverage: if denotes the fraction of sky covered by the 
experiment then 

(29) 



11.2 Baryon Acoustic oscillations 

Cosmological perturbations in the early universe excite sound waves in the photon- 
baryon fluid. After recombination, these baryon acoustic oscillations (BAO) became 
frozen into the distribution of matter in the Universe imprinting a preferred scale, 
the sound horizon. This defines a standard ruler whose length is the distance sound 
can travel between the Big Bang and recombination. The BAO are directly observed 
in the CMB angular power spectrum and have been observed in the spatial distri- 
bution of galaxies by the 2dF GRS survey and the SDSS survey |23). The BAO, 
observed at different cosmic epochs, act as a powerful measurement tool to probe 
the expansion of the Universe, which in turns is a crucial handle to constrain the 
nature of dark energy. The underlying physics which sets the sound horizon scale 
(~150 Mpc comoving) is well understood and involves only linear perturbations 
in the early Universe. The BAO scale is measured in surveys of galaxies from the 
statistics of the three-dimensional galaxy positions. Only recently have galaxy sur- 
veys such as SDSS grown large enough to allow for this detection. The existence 
of this natural standard measuring rod allows us to probe the expansion of the Uni- 
verse. The angular size of the oscillations in the CMB revealed that the Universe 
is close to flat. Measurement of the change of apparent acoustic scale in a statisti- 
cal distribution of galaxies over a large range of redshift can provide stringent new 
constraints on the nature of dark energy. The acoustic scale depends on the sound 
speed and the propagation time. These depend on the matter to radiation ratio and 
the baryon-to-photon ratio. CMB anisotropy measures these and hence fixes the os- 
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cillation scale. A BAO survey measures the acoustic scale along and across the line 
of sight. At each redshift, the measured angular (transverse) size of oscillations, A9, 
corresponds with the physical size of the sound horizon, where the angular diameter 
distance is an integral over the inverse of the evolving Hubble parameter, H (z). 
r± = (1 + z)Da(z)80 . In the radial direction, the BAO directly measure the instan- 
taneous expansion rate H(z), through = (c/H(z))Az, where the redshift interval 
(Az) between the peaks is the oscillation scale in the radial direction. As the true 
scales r± and are known (given by r s the sound horizon at radiation drag, well 
measured by the CMB) this is not an Alcock-Paczynsky test but a "standard ruler" 
test. Note that in this standard ruler test the cosmological feature used as the ruler 
is not an actual object but a statistical property: a feature in the galaxy correlation 
function (or power spectrum). An unprecedented experimental effort is undergoing 
to obtain galaxy surveys that are deep, larger and accurate enough to trace the BAO 
feature as a function of redshft. Before these surveys can even be designed it is cru- 
cial to know how well a survey with given characteristic will do. This was illustrated 
very clearly in ll24l . which we follow closely here. We will adopt the Fisher matrix 
approach. To start we need to compute the statistical error associated to a determi- 
nation of the galaxy power spectrum P(k). In what follows we will ignore effects of 
non-linearities and complicated biasing between galaxies and dark matter: we will 
assume that galaxies, at least on large scales, trace the linear matter power spectrum 
in such a way that their power spectrum is directly proportional to the dark matter 
one: P(k) = b 2 PoM{k) where b stands for galaxy bias. At a given wavevector k, the 
statistical error of the power spectrum is a sum of a cosmic variance term and a shot 
noise term: 

P(k) P(k) (M> 

Here n denotes the average density of galaxies and 1 /N is the white noise con- 
tribution from the fact that galaxies are assumed to be a Poisson sampling of the 
underlying distribution. When written in this way this expression assumes that n is 
constant with position. While in reality this is not true for forecasts one assumes that 
the survey can be divided in shells in redshifts and that the selection function is such 
that « is constant within a given shell. Since P(k) is also expected to change in red- 
shift then one should really implicitly assume that there is a z dependence in Eq.[30l 
In general P(k,z) = b(z) 2 G 2 (z)PDM(k) where G(z) denotes the linear growth factor: 
i.e. the bias is expected to evolve with redshift as well as clustering does, not only 
because galaxy bias changes with redshift but also because at different redshifts one 
may be seeing different type of galaxies which may have different bias parameter. 
We do not know a priori the form of b(z) but given a fiducial cosmological model 
we know G(z). Preliminary observations seem to indicate that the z evolution of b 
tends to cancel that of G(z), so it is customary to assume that b(z)G(z) ~ constant, 
but we should bear in mind that this is an assumption. 

An extra complication arises because galaxy redshft surveys use the redshift as 
distance indicator, and deviations from the Hubble flow therefore distort the clus- 
tering. If the Universe was perfectly uniform and galaxies were test particles these 
deviations from the Hubble flow would not exist and the survey would not be dis- 
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torted. But clustering does perturb the Hubble for and thus introduces the so-called 
redshift-space distortions in the clustering measured by galaxy redshift surveys. 
Note that redshft-space distortions only affect the line-of-sigth clustering (it is a 
perturbation to the distances) not the angular clustering. Since these distortions are 
created by clustering they carry, in principle, important cosmological information. 
To write this dependence explicitly: 

P(k,n,z) =b{z) 2 G{z) 2 P D M{k){l+ppL) 2 (31) 

where jj, denotes the cosine of the angle between the line of sight and the wavevec- 
tor. j3 = f/b = d\nG(z)/d\na/b ~ Q m (z) 06 jb. In the linear regime, the cos- 
mological information carried by the redshft space distortions is enclosed in the 
f(z) = j3 (z)b(z) combination. 

For finite surveys, P(k) at nearby wavenumbers are highly correlated, the correla- 
tion length is related to the size of the survey volume: for large volumes the cell size 
over which modes are correlated is (2n) 3 /V where V denotes the comoving survey 
volume. Only over distances in k-space larger than that modes can be considered in- 
dependent. If one therefore wants to count over all the modes anyway (for example 
by transforming discrete sums into integrals in the limit of large volumes) then each 
k needs to be downweighted, to account the fact that all k are not independent. In 
addition one should keep in mind that Fourier modes k and — k are not independent 
(the density field is real-valued!), giving an extra factor of 2 in the weighings. We 
can thus write the error on a band power centered around k, 



a P _ / 2 (\+nP\ 

T- 2K \jvk^SkAn \Hp~J- (32) 

In the spirit of the Fisher approach we now assume that the Likelihood function 
for the band-powers P(k) is Gaussian thus we can approximate the Fisher matrix 
by: 

rkmax dlnPfk) dlnP(k) , c/k 

The derivatives should be evaluated at the fiducial model and V e /f denotes the ef- 
fective survey volume given by 



Vef f (k)=V eff (k,n) = 



n(z)P(k,n) 



n{z)P(k,y.) + \ 



dz - 



nP(k,n) 



nP(k,n) + l 



V (34) 



where n = («(z)). Eq|33]can be written explicitely as a function of k and jj, as 

' rw. _ dmp(k,n)dmp{k,ri k^dkd^ 
ll ~'Jk^ de t dOj 2(2^ ' 



In writing this equation we have assumed that over the entire survey extension 
the line-of-sight direction does not change: in other words we made the flat sky 
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approximation. For forecasts this encloses all the statistical information anyway, 
but for actual data-analysis application the fiat sky approximation may not hold. In 
this equation k m i„ is set by the survey volume: for future surveys where the survey 
volume is large enough to sample the first BAO wiggle the exact value of k m i„ does 
not matter, however, recall that for surveys of typical size L (where L-V 1 / 3 ), the 
largest scale probed by the survey will be corresponding to k = 2n/L. Keeping in 
mind that the first BAO wiggle happens at ~ 150 Mpc the survey size needs to be 
L ^> 150 Mpc for £„„■„ to be unimportant and for the "large volume approximation" 
made here to hold. As anticipated above, one may want to sub-divide the survey 
in independent redshift shells, compute the Fisher matrix for each shell and then 
combine the constraints. In this case L will be set by the smallest dimension of the 
volume (typically the width of the shell) so one needs to make sure that the width 
of the shell still guarantees a large volume and large L. k max denotes the maximum 
wavevector to use. One could for example impose a sharp cut to delimit the range 
of validity of linear theory. In f25l this is improved as we will see below. 

Before we do that, let us note that there are two ways to interpret the parameters 
9n in Eq. fl35l l. One could simply assume a cosmological model, say for example a 
flat quintessence model where the equation of state parameter w(z) is parameterized 
by w(z) = w(0) + w a (l — a) and take derivatives of P{k,ji) with respect to these 
parameters. Alternatively, one could simply use as parameters the quantities H(zi) 
and Da(z/), where Zi denote the survey redshift bins. These are the quantities that 
govern the BAO location and are more general: they allow one not to choose a 
particular dark energy model until the very end. Then one must also consider the 
cosmological parameters that govern the P(k) shape £2 m h 2 , Qj,h 2 and n s . Of course 
one can also consider G(z,-) as free parameters and constrain these either through 
the overall P(k) amplitude (although one would have to assume that b(z) is known, 
which is dicey) or through the determination of G(z) and j3 (z). The safest and most 
conservative approach however is to ignore any possible information coming from 
G(z), j3 (z) or n s and to only try to constrain expansion history parameters. 

The piece of information still needed is how the expansion history information 
is extracted from P(k,jj,). When one converts ra, dec and redshifts into distances 
and positions of galaxies of a redshift survey, one assumes a particular reference 
cosmology. If the reference cosmology differs from the true underlying cosmology, 
the inferred distances will be wrong and so the observed power spectrum will be 
distorted: 

'~)ref_ 
D A (z)LeH{z)ref' 

Note that since distances are affected by the choice of cosmology and k vectors 

are: k refi = H(z)ref /H(z)truek,rue,\\ and KefX = D A(z)true/D A (z)refktme.X- Note 

that therefore in Eqf36]we can write: 



P(*x,*||) = „ rT /_, W*x,*||)- (36) 



Ptrue(k^k h z)=b(z) 2 ll+I3(z" 



k 2 4- k 2 
K true.X^ true\\ 



)' 


[G)(z)l 




[G(Zo)\ 



2 

PDM{k,Zo) (37) 
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Fig. 7 Steps to implement once the Fisher matrix of Eq[35]has been computed to obtain error on 
dark energy parameters. 



where z ( , is some reference redshift where to normalize P(k) typical choices can be 
the CMB redshift or redshift z — 0. Not that from thises equations it should be clear 
that what the BAO actually measure directly is H(z)r s and DA/r s where r s is the 
BAO scale, the advantage is that r s is determined exquisitely from the CMB. 

How would then one convert these constraints on those on a model parameter? 
Clearly, one then projects the resulting Fisher matrix on the dark energy parame- 
ters space. In general if you have a set of parameters 0, with respect to which the 
Fisher matrix has been computed, but you would like to have the Fisher matrix for 
a different set of parameters 0,- where the 0, are functions of the (pi, the operation to 
implement is: 

FMi = ^ F9 ^Ws (38) 

the full procedure for the BAO survey case is illustrated in Fig. [7] The slight 
complication is that one starts off with a Fisher matrix (for the original parameter 
set Of) where some parameters are nuisance and need to be marginalized over, so 
some matrix inversions are needed. 

So far non-linearities have been just ignored. It is however possible to include 
then at some level in this description. l25l proceed by introducing a distribution of 
gaussianly distributed random displacements parallel or perpendicular to the line of 
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Fig. 8 Percent error on H(z)r s and D a /r s as a function of the galaxy number density of a BAO 
survey. This figure assumes full sky coverage = 1 (errors will scale like 1 / ^/f s k y ) and redshift 
range from z = to z = 2 in bins of Az = 0. 1 . 



sight coming from non-linear growth (in all directions) and from non-linear redshift 
space distortions (only in the radial direction). The publicly available code that im- 
plements all this (and more) is at: 

http://cmb.as.arizona.edu/~eisenste/acousticpeak/bao_forecast.html In order to use 
the code keep in mind that |25l model the the effect of non-linearities is to convolve 
the galaxy distribution with a redshift dependent and ji dependent smoothing kernel. 
The effect on the power spectrum is to multiply P(k) by exp[— k 2 E (k,jx)/2], where 
E(k,jj.) = Ej_ — ll 2 {E 2 — E 2 ). As a consequence the integrand of the Fisher matrix 
expression of Eq. ( f3fTb is multiplied by 

exp[-k 2 zi-k 2 n 2 (Z 2 -i;l)} (39) 

where, to be conservative, the exponential factor has been taken outside the deriva- 
tives, which is equivalent to marginalize over the parameters E^ and E± with large 
uncertainties. 

Note that E\\ and E± depend on redshift and on the chosen normalization for 
fi)M(i). In particular: 

E ± [z) = E G(z)/G(z ) (40) 
E ]] (z)=E G(z)/G(z )(l+f(z)) (41) 
Eq oc (J 8 . (42) 

If in your convention zq = then Eq(z = 0) = %.6h^ 1 g%.dm{z = 0)/0.8. 

As an example of an application of this approach for survey design, it may be 
interesting to ask the question of what is the optimal galaxy number density for a 
given survey. Taking redshifts is expensive and for a given telescope time allocated, 
only a certain number of redshifts can be observed. Thus is it better to survey more 
volume but have a low number density or survey a smaller volume with higher 
number density? You can try to address this issue using the available code. For a 
cross check, figure[8]show what you should obtain. Here we have assumed Og = 0.8 
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Fig. 9 Effect of the choice of the cosmological model in the recovered values for the parameters. 
Here we used WMAP5 data only: in both panels the black line is for a standard flat LCDM model. 
In the left panel we show the posterior for Qi„ the red line is for a non-flat LCDM model. In the 
right panel we show the posterior for n s : the red line is for a LCDM model where the primordial 
power spectrum is not a perfect power law but is allowed to have some "curvature" also called 
"running" of the spectral index. Figure courtesy of LAMBDA (7). 

at z = 0, b(z = 0) = 1 .5 and we have assumed that G(z)b(z) = constant. To interpret 
this figure note that with the chosen normalizations, P(k) in real space at the BAO 
scale k ~ 0.15 h/Mpc is 6241(Mpc/h) 3 , boosted up by large scale redshift space 
distortions to roug hly 10 4 (Mpc/h) 3 so n = 10 4 corresponds to nP(k = 0.15) = 1. 
Note that the "knee" in this figure is therefore around nP = I. This is where this 
"magic number" of reaching nP ^> 1 in a survey comes from. Of course, there are 
other considerations that would tend to yield an optimal nP bigger than unity and of 
order of few. 



12 Model testing 

So far we have assumed a cosmological model characterized by a given set of cos- 
mological parameters and used statistical tools to determine the best fit for these pa- 
rameters and confidence intervals. However the best fit parameters and confidence 
intervals depends on the underlying model i.e. what set of parameters are allowed 
to vary. For example the estimated value for the density parameter of baryonic mat- 
ter £2/, changes depending wether in a LCDM model the universe is assumed fiat 
or not (Fig(9] right panel) or the recovered value for the spectral slope of the pri- 
mordial power spectrum changed depending if the primordial power spectrum is 
assumed to be a power law or is allowed to have some "curvature" or "running" 
(Fig(9]left panel). It would be useful to be able to allow the data to determine which 
combination of parameters gives the preferred fit to the data: this is the problem of 
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model selection. Here we start by following Ifl6l which is a clear introduction to 
the application of this subject in cosmology. Model selection relies on the so-called 
"information criteria" and the goal is to make an objective comparison of different 
models which may have a different number of parameters. The models considered 
in the example of Fig, [9] are "nested" as one model (the LCDM one) is completely 
specified by a sub-set of the parameters of the other (more general) model. In cos- 
mology one is almost always concerned with nested models. 

Typically the introduction of extra parameters will yield an improved fit yo the 
data set, so a simple comparison of the maximum likelihood value will always favor 
the model with more parameters, regardless of wether the extra parameters are rele- 
vant. There are several different approaches often used in the literature. The simplest 
is the likelihood ratio test ifTSIl see S|6] Consider the quantity 2ln[^f S i mp i e / J^ comp i ex ] 
where J£ S impU denotes the maximum likelihood for the model with less parameters 
and ^complex] the maximum likelihood for the other model. This quantity is approx- 
imately chisquare distributed and thus the considerations of sec|4]can be applied. 

The Akaike information criterion (AIC) ifTTl is defined as: AIC = — 21nJzf + 
2k where _S? denotes the maximum likelihood for the model and k the number of 
parameters of the model. The best model is the one that minimizes AIC. 

The Bayesian information criterion (BIC) lfl8l is defined as: BIC = — 21nJz? + 
klnN where N is the number of data points used in the fit. 

It should be clear that all these approaches tend to downweight the improvement 
in the likelihood value for the more complex model with a penalty that depends on 
how complex is the model. Each of these approaches has its pros and cons and there 
is no silver bullet. 

However it is possible to place model selection on firm statistical grounds within 
the Bayesianapproach by using the Bayesian factor which is the Bayesian evidence 
ratio (i.e. the ratio of probabilities of the data given the two models). 

Recalling the Bayes theorem (EqEJ we can write: &>{D) = & > (D\M i )&>{M i ) 
where ; runs over the models M we are considering. Then the Bayesian Evidence is 

@>{D\Mi) = J dQ&>(D\Q,Mi)&{Q\M{) (43) 

where &(D\6 ,Mj) is the likelihood. Given two models (/ and J), the Bayes factor is 

,J ~ <Z*(D\Mj)- ( } 

A large B, 7 denotes preference for model i. In general this requires complex numer- 
ical calculations, but for the simple case of Gaussian likelihoods it can be expressed 
analytically. The details can be found e.g. in |fl9l and references therein. For a di- 
dactical introduction see also 
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13 Monte Carlo methods 

With the recent increase in computing power, in cosmology we resort to the appli- 
cation of Monte Carlo methods ever more often. There are two main applications 
of Monte Carlo methods: Monte Carlo error estimations and Markov Chains Monte 
Carlo. Here I will concentrate on the first as there are several basics and details 
explanations of the second (see e.g. 12D and references therein). 

Lets go back to the issue of parameter estimation and error calculation. Here is 
the conceptual interpretation of what it means that en experiment measures some 
parameters (say cosmological parameters). There is some underlying true set of pa- 
rameters Qtrue that are only known to Mother Nature but not to the experimenter. 
There true parameters are statistically realized in the observable universe and ran- 
dom measurement errors are then included when the observable universe gets mea- 
sured. This realization gives the measured data Do . Only Do is accessible to the 
observer (you). Then you go and do what you have to do to estimate the parame- 
ters and their errors (chi-square, likelihood, etc.) and get 9o- Note that Do is not a 
unique realization of the true model given by 9 true : there could be infinitely many 
other realizations as hypothetical data sets, which could have been the measured 

one: D2,D2,D3... each of them with a slightly different fitted parameters Q\, 02 

00 is one parameter set drawn from this distribution. The hypotetical ensamble of 
universes described by 0, is called ensamble, and one expects that the expectation 
value (Gi) = G true . If we knew the distribution of 0; — G trU e we would know every- 
thing we need about the uncertainties in our measurement Gq . The goal is to infer 
the distribution of Gj — Gtrue without knowing G lrue . Heres what we do: we say that 
hopefully Go is not too wrong and we consider a fictitious world where Gq was the 
true one. So it would not be such a big mistake to take the probability distribution 
of Gj — Go to be that of 0, — Q true . In many cases we know how to simulate 0, — 0o 
and so we can simulate many synthetic realization of "worlds where 0o is the true 
underlying model. Then mimic the observation process of these fictitious Universes 
replicating all the observational errors and effects and from each of these fictitious 
universe estimate the parameters. Simulate enough of them and from Qf — Gq )where 
S stands for "synthetic" or "simulated") you will be able to map the desired multi- 
dimensional probability distribution. With the advent of fast computers this tech- 
nique has become increasingly widespread. As long as you believe you know the 
underlying distribution and that you believe you can mimic the observation replicat- 
ing all the observational effects this technique is extremely powerful and, I would 
say, indispensable. This is especially crucial when complicated effects such as in- 
strumental and or systematic effects can be simulated but not described analytically 
by a model. 



30 

14 Conclusions 



Licia Verde 



I have given a brief overview of statistical techniques that are frequently used in the 
cosmological literature. I have presented several examples often from the literature 
to put these techniques into context. This is not an exhaustive list nor a rigorous 
treatment, but a starter kit to "get you started". As more and more sophisticated 
statistical techniques are used to make the most of the data, one should always re- 
member that they need to be implemented and used correctly: 

• data gathering is an expensive and hard task: statistical techniques make possible 
to make the most of the data 

• always beware of systematic effects 

• an incorrect treatment of the data will give non-sensical results 

• there will always be things that are beyond the statistical power of a given data 
set 

Remember: "Treat your data with respect!" 



15 Some useful references 

There are many good and rigorous statistics books out there. In particular Kendall's 
advanced theory of statistics made of three volumes: 

• Distribution theory (Stuart & Ort 1994) 03) 

• Classical Inference (Stuart & Ort 1991) fH) and 

• Bayesian Inference (O'Hagan 1994) lfl5l . 

For astronomical and cosmological applications in many cases one may need a prac- 
tical manual rather than a rigorous textbook. Although it is important to note that a 
practical manual is no substitute for a rigorous introduction to the subject. 

• Practical statistics for Astronomers, by Wall & Jenkins, (2003) is a must have 

ID- 

• Numerical Recipes is also an indispensable "bible": Press et al (1992)|6| 

It also provides a guide to the numerical implementation of the "recipes" discussed. 
Complementary information to what presented here can be found in 

• Verde, in XIX Canary Island Winter School "The Cosmic Microwave Back- 
ground: from Quantum fluctuations to the present Universe" 1211 . In the form 
of lecture notes, and 

• Martinez, Saar, "Statistics of the galaxy distribution" l22l . with a slant on Large 
scale structure and Data Analyais in Cosmology, Martinez, Saar, Martinez- 
Gonzalez, Pons-Porteria, Lacture Notes in Physics 665, Springer, 2009 
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